Correcting Jaccard and other similarity indices for chance agreement in cluster analysis
نویسندگان
چکیده
Correcting a similarity index for chance agreement requires computing its expectation under fixed marginal totals of a matching counts matrix. For some indices, such as Jaccard, Rogers and Tanimoto, Sokal and Sneath, and Gower and Legendre the expectations cannot be easily found. We show how such similarity indices can be expressed as functions of other indices and expectations found by approximations such that approximate correction is possible. A second approach is based on Taylor series expansion. A simulation study illustrates the effectiveness of the resulting correction of similarity indices using structured and unstructured data generated from bivariate normal distributions.
منابع مشابه
Unsupervised Feature Selection by Means of External Validity Indices
Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the...
متن کاملGenetic Variation within Iranian Iris Species Using Morphological Traits
Iris belongs toIridaceae family and it is monocotyledon. Iris is one of the important ornamental and medicinal plants. 34 iris genotypes (14 species) collected from different provinces of Iran were planted at National Institute of Ornamental Plants (NIOP) Iran. All of the species evaluated for 15 quantitative traits and 30 qualitative traits. Results showed that the highest positive correlation...
متن کاملComparison of Similarity Coefficients used for Cluster Analysis with Amplified Fragment Length Polymorphism Markers in the Silkworm, Bombyx mori
Establishing accurate genetic similarity and dissimilarity between individuals is an essential and decisive point for clustering and analyzing inter and intra population diversity because different similarity and dissimilarity indices may yield contradictory outcomes. We assessed the variations caused by three commonly used similarity coefficients including Jaccard, Sorensen-Dice and Simple mat...
متن کاملارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها
Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...
متن کاملComparison of Clustering Approaches for Gene Expression Data
Clustering algorithms have been used to divide genes into groups according to the degree of their expression similarity. Such a grouping may suggest that the respective genes are correlated and/or co-regulated, and subsequently indicates that the genes could possibly share a common biological role. In this paper, four clustering algorithms are investigated: k-means, cut-clustering, spectral and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Adv. Data Analysis and Classification
دوره 5 شماره
صفحات -
تاریخ انتشار 2011